-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not unnecessarily retransmit commitment_signed
in dual funding
#1214
base: master
Are you sure you want to change the base?
Do not unnecessarily retransmit commitment_signed
in dual funding
#1214
Conversation
On reconnection in the middle of the dual-funding flow, if both nodes have exchanged the initial `commitment_signed` and node A had sent its (initial) `tx_signatures` but node B never received them, both nodes should send a `channel_reestablish` with `next_funding_txid` set and a `next_commitment_number` of 1 (as they've already received the commitment transaction for commitment number 0). The spec indicates in this case that both nodes should retransmit their `commitment_signed`, however, as this is only gated on `next_funding_txid` and not the `next_commitment_number` field. This may cause implementations which assume that each new `commitment_signed` is for a new state to fail and potentially fail the channel. Instead, we should rely both the presence of `next_funding_txid` *and* `next_commitment_number` being zero to decide if we need to resend our `commitment_signed`. Sadly, we cannot rely on just `next_commitment_number` as that is used to request a force-closure in a non-standard way to work around implementations not honoring the `error` message.
Sending a `channel_reestablish` with `next_commitment_number` of zero is used in practice to request a peer fail a channel and broadcast the latest state (for implementations which continue to ignore the `error` message). Because its used in practice we should document it to avoid accidentally writing up incompatible things in the future.
- if `next_commitment_number` is zero: | ||
- MUST retransmit its `commitment_signed` for that funding transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that you didn't add a requirement that we MUST NOT retransmit commitment_signed
if next_commitment_number = 1
. That's a good thing, because this wouldn't be backwards-compatible, since eclair
and cln
currently always send next_commitment_number = 1
and will always retransmit commit_sig
if next_funding_txid
is included.
We previously discussed whether it was worth avoiding unnecessarily retransmitting commit_sig
in that case. We decided that since it was simple to ignore a spurious retransmission, it wasn't worth the extra effort. But if you feel that it's important to be cleaner here and avoid that retransmission, we can definitely add it, WDYT @ddustin @niftynei?
In order to introduce this without breaking too many nodes, I think we should deploy that change using the following steps:
- as the receiver of
channel_reestablish
, acceptnext_commitment_number = 0
ifnext_funding_txid
is set - (once enough nodes have upgraded with the behavior above) as the sender of
channel_reestablish
, setnext_commitment_number = 0
when asking to retransmitcommit_sig
- (once enough nodes have upgraded with the behavior above) as the receiver of
channel_reestablish
, don't retransmitcommit_sig
ifnext_commitment_number = 1
If we decide to go down that road, we must add the same mechanism for splicing: retransmitting commit_sig
when next_funding_txid
is set should be gated on next_commitment_number
being equal to the current commitment_number
. I'm implementing this in eclair to verify that this doesn't have any unintended side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that was a bit of an oversight. I don't really see why you'd want to spuriously retransmit when its so trivial to avoid. I'm not sure about the current dual-funding implementation but in LDK more generally if you retransmit a commitment_signed post-open we'll definitely think its invalid and FC the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of upgrade, I don't see why we wouldn't just stop sending the spurious retransmit. Most nodes that support dual-funding will upgrade quickly but generally nodes should in practice handle the spurious retransmit for a bit, but I don't really think it needs to be around that long :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to handle backwards-compatibility, because there are two implementations that already shipped with the current retransmit behavior and it's actively used on mainnet. It doesn't have to take a year, but we must coordinate releases with cln
if we make this change and allow a grace period where we retransmit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see why you'd want to spuriously retransmit when its so trivial to avoid.
It's even more trivial to ignore the spurious retransmit and not have to deal with thinking about whether to retransmit or not (and always send commit_sig
before tx_signatures
), which is what we chose to currently implement...to be honest it's unclear to me why it's a real problem for you (apart from "we can avoid it")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that for splicing this is slightly more complex:
- our channel is at
local_commitment_number = N
- if we disconnect, when reconnecting, we would set
next_commitment_number = N+1
- we start splicing and disconnect after sending our
commit_sig
but before receiving the remotecommit_sig
- on reconnection, if we want to avoid spurious retransmits, we need to set
next_funding_txid
andnext_commitment_number = N
, right? Because we're missing our peer'scommit_sig
for commitment N that spends the new splice transaction - with what we currently implemented, we didn't make any change to the
channel_reestablish
logic and would be sendingnext_commitment_number = N + 1
, and if we sentnext_commitment_number = N
our peer would currently think we're late (because we're not looking at whethernext_funding_txid
is set in that case, which would remove the confusion)
So we need to more carefully manage our deployment to ensure we're not force-closing channels in the process 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As was discussed during this week's spec meeting, since this only matters if peers disconnect in the middle of the signing flow, which should be extremely rare for server nodes, we probably don't need to care that much about phasing deployments between eclair
and cln
, and can directly implement this PR. We need a sign-off from cln
though, @ddustin @niftynei does that sound good to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protocol is as written currently because there is (was?) no ACK to signal that you've received the commit_sig
in the signing flow. Therefore on reconnection, nodes would simply retransmit the commit_sig
message, as there was no signaling mechanism to know that the peer had received the message.
My understanding of this change is that @TheBlueMatt has identified that the next_commitment_number
can be used as an on-reconnection signal to ACK the receipt of commit_sig
prior to disconnection. If the next_commitment_number
is set to the current index AND the next_funding_txid
is correct, that would signal to the peer to retransmit the commit_sig
. In the case where you've never received a commitment sig, the index would be set to 0
. This is new/novel behavior. I don't think any node ever sends a next_commitment_number
set to zero currently; 1 is the lowest number currently transmitted.
Normally the next_commitment_number
is set to current index plus one; being set to the current index is the signal that we were missing in the original design.
I think I'm confused as to why this is something that's only ever applicable for the origination of a channel, not something you'd also want to update for a splice? eg. wouldn't a splice be at N
or N+1
not 0
or 1
as at open?
this only matters if peers disconnect in the middle of the signing flow, which should be extremely rare for server nodes,
fwiw I disagree that rarity of occurrence should dictate handling of an edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of this change is that @TheBlueMatt has identified that the next_commitment_number can be used as an on-reconnection signal to ACK the receipt of commit_sig prior to disconnection.
Your understanding is correct ;)
We already had discussions during the dual-funding review about whether we wanted to use a signal to let our peer know whether they should retransmit commit_sig
before tx_signatures
or not, but we decided it was an optimization that wasn't worth the extra effort. But it is cleaner to avoid an unncessary retransmission, so it's probably a good idea to challenge this.
I think I'm confused as to why this is something that's only ever applicable for the origination of a channel, not something you'd also want to update for a splice?
This is indeed something we want to do for splicing as well. I was waiting for feedback from the cln
team to update the splice spec PR accordingly.
fwiw I disagree that rarity of occurrence should dictate handling of an edge case.
In that case we should simply do a phased deployment of this:
- eclair and cln release an update that accepts
next_commitment_number = 0
whennext_funding_txid
is set - afterwards, in the next update, eclair and cln start setting
next_commitment_number = 0
when they're requesting a retransmission ofcommit_sig
, but keep retransmittingcommit_sig
whennext_commitment_number = 1
andnext_funding_txid
is set - afterwards, in the next update, eclair and cln stop retransmitting
commit_sig
whennext_commitment_number = 1
How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is all the force closes this will cause. Even CLN <-> CLN channels between versions would force close because of the next_commitment_number = 0
.
Wouldn't this make more sense in a TLV field so we're not breaking existing stuff?
- if `next_funding_txid` is not set, and `next_commitment_number` is zero: | ||
- MUST immediately fail the channel and broadcast any relevant latest commitment | ||
transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were definitely still doing it post-934, we started doing it (in addition to an error
) to get LND to FC when we ask maybe a year ago and they were still doing it then AFAIU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ziggie1984 this is the discussion we had about lnd
's behavior: when restoring from an SCB, it looks like lnd
is sending a channel_reestablish
with next_commitment_number = 0
. When that happens, does the sender need its peer to immediately force-close, or will lnd
send an error
when receiving the remote channel_reestablish
to request a force-close? The latter would be best, as it is spec compliant and wouldn't require adding this new requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LND will respond to a normal establishment msg with an error so there is no need for the other implementation to immediately force close when receiving the establishment-msg with a next_commitment_number == 0
, but I think having this code part in place does also not violate the spec am I correct ? Meaning that we can take it out of the Spec but if LND still wants to force-close if the next_commitment_number
is not equal to the local height we have for this channel then I think that's ok as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wasn't quite the question, though - will lnd force-close a channel when it receives an error
message from its peer (as the spec mandates)? Otherwise, AFAIU, other implementations have to implement a hack to send a reestablish with next_commitment_number = 0
specifically to request the peer force-close the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triggering a force-close with a channel-reestablishment (setting next_commitment_number = 0) msg seems more robust rather then always force closing when an error is received wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can somebody form CLN look whether CLN still sends an error if there are gossip sync issue ?
I don't think this is the case, as we haven't seen such error
from our node with cln
peers.
yes
Good, thanks for clarifying that. Since the SCB-restoring node sends error
, we don't need to add this requirement @TheBlueMatt: eclair
, cln
and ldk
will force-close when receiving this error
, which is spec-compliant. If lnd
wants to immediately force-close when receiving channel_reestablish
with next_commitment_number = 0
, they may do so but that doesn't need to become standard.
Just one last question: the error sent by lnd
isn't "internal error", is it? This is the only error
on which we don't force-close, because it was a transient one on previous lnd
versions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triggering a force-close with a channel-reestablishment (setting next_commitment_number = 0) msg seems more robust rather then always force closing when an error is received wdyt ?
I disagree, and a good counter-example to that is that dual-funding has valid reasons to set next_commitment_number = 0
that does not request a force-close.
It's a much better spec to force-close on error
s. It's the responsibility of implementation to not send error
when force-closing isn't required: I don't see why that cannot be correctly implemented? Especially since we now have the warning
message for soft errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, haven't looked into dual-funding, that then needs to change on our side and we need to make it more strict.
Just one last question: the error sent by lnd isn't "internal error", is it? This is the only error on which we don't force-close, because it was a transient one on previous lnd versions...
We sent the following error string:
unable to resume channel, recovery required
There seems to be no error codes for these error msgs:
https://github.com/lightning/bolts/blob/c41536829c1461fe5109183ce8a7f88f5c81348b/01-messaging.md#the-error-and-warning-messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We sent the following error string:
Perfect, we'll definitely force-close on that one!
There seems to be no error codes for these error msgs:
Because there shouldn't need to be any error code, since the behavior should be to:
- only send
error
when you want your peer to force-close - force-close when receiving
error
, whatever it contains
But implementations have had to deviate from the spec because lnd
(and based on what you're saying, cln
as well) was sending error
s for non-fatal issues that actually didn't require a force-close, and it made people lose many sats in on-chain fees...
…umber` As pointed out in lightning/bolts#1214, when reconnecting a partially signed `interactive-tx` session, we should set `next_commitment_number` to the current commitment number if we haven't received our peer's `commit_sig`, which tells them they need to retransmit it. That's not what we're currently doing: we're currently setting this value to the next commitment number, regardless of whether or not we have received our peer's `commit_sig`. And we always retransmit our `commit_sig` if our peer is setting `next_funding_txid`, even if they have already received it. More importantly, if our peer behaves correctly and sends us the current commitment number, we will think that they're late and will halt, waiting for them to send `error`. This commit fixes that by allowing our peers to use the current commitment number when they set `next_funding_txid`. Note that this doesn't yet make us spec-compliant, but in order to guarantee backwards-compatibility, we must first deploy that change before we can start removing spurious `commit_sig` retransmissions.
We fully implement lightning/bolts#1214 to stop retransmitting `commit_sig` when our peer has already received it. We also correctly set `next_commitment_number` to let our peer know whether we have received their `commit_sig` or not.
As pointed out in lightning/bolts#1214, when reconnecting a partially signed `interactive-tx` session, we should set `next_commitment_number` to the current commitment number if we haven't received our peer's `commit_sig`, which tells them they need to retransmit it. More importantly, if our peer behaves correctly and sends us the current commitment number, we must not think that they're late and halt, waiting for them to send `error`. This commit fixes that by allowing our peers to use the current commitment number when they set `next_funding_txid`. Note that we keep retransmitting our `commit_sig` regardless of the value our peer set in `next_commitment_number`, because we need to wait for them to have an opportunity to upgrade. In a future commit we will stop sending spurious `commit_sig`.
We fully implement lightning/bolts#1214 to stop retransmitting `commit_sig` when our peer has already received it. We also correctly set `next_commitment_number` to let our peer know whether we have received their `commit_sig` or not.
Core lightning currently follows these rules during
|
It's worth noting that the proposed change (setting |
On reconnection in the middle of the dual-funding flow, if both nodes have exchanged the initial
commitment_signed
and node A had sent its (initial)tx_signatures
but node B never received them, both nodes should send achannel_reestablish
withnext_funding_txid
set and anext_commitment_number
of 1 (as they've already received the commitment transaction for commitment number 0).The spec indicates in this case that both nodes should retransmit their
commitment_signed
, however, as this is only gated onnext_funding_txid
and not thenext_commitment_number
field. This may cause implementations which assume that each newcommitment_signed
is for a new state to fail and potentially fail the channel.Instead, we should rely both the presence of
next_funding_txid
andnext_commitment_number
being zero to decide if we need to resend ourcommitment_signed
. Sadly, we cannot rely on justnext_commitment_number
as that is used to request a force-closure in a non-standard way to work around implementations not honoring theerror
message.